# -*- coding: utf-8 -*-
# Author:   $Author: merkosh $
# Revision: $Rev: 57 $

Howto parse
===========

This document covers brievly the two methods how to parse a html encoded
webpage in order to extract field information.


Related information:
--------------------

For information on program invocation by lmc or return format, please
Format.txt.


Deep parsing vs. shallow parsing
--------------------------------

Under deep parsing lets understand the creation of a tree-like 
structure, assigning parent, child and sibbling elements to HTML-tags,
probably in accordance to some W3C DTD.

Let shallow parsing be searching and extracting text information, 
regardless their parent-child relationship, skipping unwanted tags
inbetween, etc.


Why not to use deep parsing
---------------------------

HTML is usually formatted such that the user can visually percieve 
information well. HTML is formatted according to a DTD, not the
contents you intend extract. 
i.e. considder <span></span> tag, which is almost never visible, but
     introduces different sibbling relations

Websites which use some script backend usually produce its output
in some systematic sort of way. However, small irregularities might
not be displayed by your browser, i.e. faulty syntax, etc.
but your parser will still have to cope with that.

Deep parsing does not structure what you are looking for. It structures
the HTML code. 


When to use deep parsing
------------------------

Deep parsing is appropriate when you are directly working with XML and
an associated DTD /Schema and you can access your field information
with a XML-Parser or your parser provides functions which allow you
to access the information you require faster and more convenient.

Example:
The script amazon-en.pl uses the Perl HTML::TreeParser object to extract
all hyperlinks. It can do so because Amazon has a horrible HTML structure,
but the embedded hyperlinks are structured in a way what all necessary
information can be directly extracted from them.

Rule of the thumb: do not make it nice, make it work.


